knitr::opts_chunk$set(warning=FALSE, message=FALSE)

1 Welcome

In this write-up I will be going over a small Analysis of the COVID-19 data, specifically a look into Italys’ Confirmed Deceased Cases. Then I will hand-build, and implement, a Monte Carlo Algorithm, using a Markov Chain Technique. This will help build information and give insight into how exactly this virus will progress, and can lead to knowledge about how COVID-19 is spreading in the communities.

For all the graphs, please know that that they are interactive! You can deselect anything in the legends, or mouse over them all for more detailed information!

DATA GATHERED FROM:

https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

1.1 What is a Markov Chain Monte Carlo Algorithm

A Morkov Chain Monte Carlo Algorithm follows the basic concept that all Monte Carlo Algorithms have:

  • It simulates a bunch of paths, or ‘walks’, that can occur.
    • Essentially, it is a good technique at exploring every option, and finding the likelyhood of something happening.
  • It creates the paths based on the previous data, generating a random number based on the probability of an event occuring.
    • In this case, the Algorithm I will build will generate a random number based on the previous Changes in the Rate of Change for the Deceased Cases.
  • Monte Carlo Algorithms are generally simulated thousands of times.
    • I will only be running a max of 100 trials, this is due to computer power, as well as ease of analysis for this write-up.

Now, lets dig into how a Markov Chain operates:

  • It uses a set window to judge the next random number off of.
    • This can be pictured as a revolving door, or a first-in first-Out line.
  • Essentially, you start with the set window, say the past 8 days rate of change value, then after you predict the next day you drop the oldest rate out.
    • This constantly limits the Algorithm to predict a random number based on the chosen window, in this case 8 days.
      • However, the Algorithm will be implemented with multiple windows.

This technique is useful for a few reasons,

  • It allows the Algorithm to converge to a number, based on the previous data.
    • This is an attempt to mimic what will actually happen over time.
  • It allows for a full exploration of the possible paths,
    • this is especially true as the number of simulation paths increases.
  • With enough simulations, one can deduce the likelyhood of certain aspects happening.
    • For example, what is the most likely number the rate of change will converge to.

1.2 Cleaning the data

Here I am doing a few things:

  • I made a row for each status, including a new active status

    • \(Acive = Confirmed - (Deaths + Recovered)\)
  • I got rid of all columns except Country, Status, Value, Date

  • I added a new region called “Global”, this holds all the values for every country combined

options(stringsAsFactors=FALSE)
library(plyr)
library(dplyr)
library(reshape2)
library(ggplot2)
library(plotly)
library(highcharter)

#function for grabbing status from my filenames
getStatus<- function(x){
  strsplit(x, "-")[[1]] %>%
    last() %>%
    gsub(pattern="\\.csv", replacement="")
}

#function created for adding an active status
createActive <- function(x){
  dcast(Country.Region + Date ~ Status,
        data=x, value.var="Value") %>%
    mutate(Active = Confirmed - (Deaths + Recovered)) %>% 
    melt(id = c("Country.Region", "Date"))
}

#function used to convert a dataset from long to wide
convert <- function(x){ dcast(Country + Date ~ Status,
        data=x, value.var="Value")}

###########################################
### This is where I start cleaning the data
files <- list.files("raw", full.names = TRUE)

raw <- files %>% #read in files
  lapply(function(x){
  read.csv(x) %>% 
    mutate(
      Date = as.Date(Date, "%m/%d/%Y"),
      Status = getStatus(x) #add status
    )
}) %>% 
  bind_rows() %>% #combine files
  subset(!(Value==0) )

raw <- raw %>% #combine countries into one
  group_by(Country.Region, Date, Status) %>% 
  summarise(
    Value = sum(Value)
  )

raw <- raw %>% #get global stats
  group_by(Date, Status) %>% 
  summarise(
    Value = sum(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Country.Region = "Global"
  ) %>% 
  bind_rows(raw)

raw <- raw %>% #create active columns, delete nulls, rename
  createActive() %>% 
  subset(!is.na(value)) %>% 
  rename(
   Country = Country.Region,
    Value = value,
    Status = variable
  )

total <- raw %>% #Get total values for seperate df
  group_by(Country, Status) %>% 
  summarise(
    Value = max(Value)
  ) %>% 
  ungroup() %>% 
  mutate(
    Status = as.character(Status)
  )

#get change per day
raw <- raw %>%
  group_by(Country, Status) %>% 
  mutate(
    Change =  Value - lag(Value, k=1),
    Rate_Change =  (Value - lag(Value, k=1))/lag(Value, k=1) 
    ) %>% 
  ungroup() %>% 
  mutate(
    Status = as.character(Status)
  )

Now that the data is clean, lets start digging into the data!

2 A Quick Analysis

Lets start with looking at the overall feel of the top 5 Countries, in terms of cases.

2.1 Cumulative Total Cases

The above graph shows the differences of cumulative cases by type, and by country. Some pretty interesting things can be found here. For example, Italy has the highest deaths and active cases, yet they are only second in the number of confirmed cases. This could mean a few things:

  • Italy may not have been testing enough people, which would start to bring their Mortality Rate down.

  • Italys’ healthcare system is getting overrun, and their deaths are about to skyrocket.

This graph perfectly shows just how many Deaths Italy has compared to the other four countries. At the time of writing this, Italy has almost more Deaths than the other 4 countries combined!

For the remainder of this analysis, I will be focusing on Italy.

2.2 Rates of Change

Theres a couple things to keep in mind when looking at the rate of change. For example,

  • It has more variance in the beginning.

  • As the number of cases get larger, the rate will tend to even out, or even decline.

This is the basic Exponential growth function:

  • \(f(t,r)=P_0(1+r)^t\)

The rate is what gets exponentiated for the function, ie, if the rate of change is higher, the confirmed cases will get much higher with each day. Also, if the rate of change is lower that means the cases per day has dropped from the previous days’ number. If there starts to be a trend of consecutively lower rates of change, this means that the disease is starting to slow down, which can be indicitive of being past the peak.

Remember: You can deselect the different Rates in order to get a closer look.

This graph plainly shows that the Deaths rate of change has not only a higher average, but most of its’ rates are higher than the rate of confirmed. This tells us the number of deaths are going up much faster than the confirmed cases are.

There are a multitude of reasons why this could happen:

  • The Confirmed cases are lagging behind, due to the incubation period.

  • The Confirmed cases are innacurate, as finding this true value is difficult.

  • The Mortality rate is not a static number,

    • this is a likely scenario for a few reasons,
      • the healthcare systems can be overrun, causing more deaths.
      • the population ages vary depending on location, and COVID-19 is known to have a higher Mortality rate among an older population, -as well as much more unknown factors.
  • The Rates of change will tend to even out over time, while will really be shown in the Changes of the Rates themselves.

    • Meaning in the beginning stages of the rate of change, it will start with a lot of extreme jumps.
      • For example, if there is 1 person infected today, and another tomorrow, for a total of two Deaths, the rate of change from day 1 to day 2 will be 1.0.
      • Also, if the next day has no Deaths reported, then the rate of change will stay at 1.0, however the change in that rate per day will be 0.

This will be important for how the Monte Carlo Algorithm is built.

2.3 Deeper look at Italys’ Death Rate of Change

Looking at this scatter plot, as the number of Deceased cases rise, the Rate of Change trends downwards. This means that Italy has a negative correlation between the actual change in cases per day and the rate of change. This is a good thing, as it could mean a few things:

  • The infection rate is starting to slow down.

  • The rate of change is starting to converge to a lower number.

    • Essentially, as the number of days increase, the variance in the changes of these rates will decrease.
      • Meaning the rate of change should get closer to a linear line, as the number of days increase.

This is important for the Monte Carlo Algorithm, as it needs to emulate this effect on its predictions.

Lets find out how well correlated the two actually are, using pearsons r method.

Before this is done, the outliers need to be taken out of the data, as some of the early rates as those are innacurate, for the reasons described before.

Now that that is done, lets check to see if the Rate of Change distribution is relatively normal.

The distribution here is the rate of changes after removing the outliers. There is a clear Positive Skew in this graph, meaning the bulk of the data is trending to the left.

We should see this trend increase after the Monte Carlo Algorithm is applied.

2.4 Shapiro-Wilk Test

Lets use a shapiro Wilk Test to check its normality. But first, I will explain what the test is:

*The null-hypothesis of this test is that the population is normally distributed. + Thus, if the p value is less than the chosen alpha level, - then the null hypothesis is rejected and there is evidence that the data tested are not normally distributed. -likewise, if the p value is greater than the alpha level, then the data is normally distributed!

  • It is regarded as being the best test for checking the significance the above null Hypothesis;
    • fun fact: this was tested by Monte Carlo Simulations.
  • \(W=\frac{(\sum\limits_{i=1}^{n}a_ix_{(i)})^2}{(\sum\limits_{i=1}^{n}x_i-\bar{x})^2}\)

In plain terms, this is the slope of the QQ plot, over the variance. As the slope of the QQ plot approaches the Variance, ie, the distribution reaches normality, W will approach 1.

Thus, a number close to 1 indicates a higher level of normality!

## 
##  Shapiro-Wilk normality test
## 
## data:  std_corr$Rate_Change
## W = 0.94503, p-value = 0.1483

Awesome! this tells us that the p value is 0.1483 which is greater than 0.05 (a 95% confidence interval)!

This is great news! As it means that the null hypothesis is rejected! Telling us two things:

  • The Change in Rates Distribution is not stastically different from a normal distribution.

  • The W value is 0.945 which is less than 1, but only by 0.05497,

    • This means that the slope of its QQ Plot is relatively close to 1, telling us this data is really close to a normal distribution.

In summary, this means that the Rate of Change, in the Deaths, can be used as a predictor for the Monte Carlo Simulator!

2.5 Pearsons r and r^2

Lets take a look into Pearsons r, this will tell us how correlated the data is, ie, how well fit the mean line is compared to the actual data.

As expected there is a negative correlation between the Rate of Change and Total cases, aswell as a negative correlation between the Rate of Change and the cases per day.

This makes sense, and indicates that as the amount of Cumulative Deaths increase the Rate of Change will start to decrease aswell. This is great news for a few reasons:

  • This ultimately suggests that the rate of Deaths per day is slowing down

  • This suggests that the Rate of Change is starting to normalize, causing less drastic changes in the over changes in these Rates.

    • Generally, the Rate of Change will start to converge to a single number.

The opposite is true with the positive r values. For exampe, the r value between the Cumulative Deaths and the Cases per day is 0.938! This highly suggests that as the CUmulative Deaths start to increase, so will the Number of Cases per day.

That is about all the information that Peasrsons r can provide, so lets take a loot at Pearsons r squared!

The r squared value can tell us much more, for example, there is a r^2 value of 0.88 between the Cumulative deaths and the Cases per day. This means 88.7% of the Cases per day can be explained by the Total cases, or vice versa! This goes on for all of the relatively high r^2 values, showing a strong correlation between a lot of the different calculations.

Let’s check if these values are statistically significant, to do that the following test needs to be done.

2.6 Testing the Correlations

Null Hypothesis:

  • There is not a significant linear relationship between the following categories:
    • Cumulative Deaths and Cases per day,
      • \(r^2=0.88\)
    • The Rate of Change and the Cumulative Deaths,
      • \(r^2=0.279\)
    • The Rate of Change and the Changes in those Rates,
      • \(r^2=0.469\)
  • Therefore, the regression line CANNOT be used to model a linear relationship in the population.

Alternate Hypotheses:

  • There is a significant linear relationship between the above categories.

This test requires the following formula:

  • \({t}=\frac{r*\sqrt[2]{n-2}}{\sqrt[2]{1-r^2}}\)
##                  Deaths     Change Rate_Change     D_RC_C
## Deaths              Inf 14.0777415    3.230726 0.01932788
## Change      14.07774150        Inf    2.035128 0.98035389
## Rate_Change  3.23072584  2.0351284         Inf 5.15919843
## D_RC_C       0.01932788  0.9803539    5.159198        Inf

These are the test statistics for each r value, now the below formula can be used to check if the r value is statistically significant.

  • \({-t} > r > t\)

Here is a matrix of the r values.

##                   Deaths     Change Rate_Change      D_RC_C
## Deaths       1.000000000  0.9381351  -0.5280149 0.003719627
## Change       0.938135074  1.0000000  -0.3646870 0.185398322
## Rate_Change -0.528014886 -0.3646870   1.0000000 0.704578921
## D_RC_C       0.003719627  0.1853983   0.7045789 1.000000000

This tells us that all of the correlations are statistically significant! Which means that all of the factors can be used as a predictor of the other.

This is very important for building the algorithm to the Monte Carlo Simulator.

2.7 Quick Notes: Infected Cases

There are a few things I want to point out about COVID-19:

  • The true Cumulative Infected Cases would consist of everybody that has, at some point, contracted COVID-19,
    • including those that are Deceased, Recovered, or Actively sick.
  • The median incubation period is 5 days.
    • This means that there is most likely a 5 day lag in the confirmed cases, as 50% of infected individuals will not experience any symptoms before the incubation period.
      • This is ontop of an innacurate Cumulative Confirmed Cases.
  • The average days from illness onset to death is 18 days.
    • This can help us estimate the amount of infections 23 days prior,
      • as it takes about 5 days for the illness onset to begin after the time of infection.

Using these facts, one can estimate the average time from point of infection to death is 23 days (5 incubation days + 18 days till death). This means that if someone dies today from COVID-19 it can be estimated that they got infected 23 days prior. This will help us later on for assessing the total amount of Infected Cases.

These are also the reasons I believe it is best to use the Deaths as the base predictor for the Monte Carlo Simulator.

SOURCES:

https://www.uptodate.com/contents/coronavirus-disease-2019-covid-19?source=history_widget

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)30566-3/fulltext

3 Building the Monte Carlo Simulation

Now that the correlations have been tested against eachother, the process for the algorithm can begin!

This algorithm will be broken down into the following steps:

Great, now once the values are predicted, the algorithm will need to math the Change in Rates back to the next Rate of Change, then the Deaths per day, then finally the predicted Cumulative Deaths!

3.1 The math behind the Monte Carlo Simulation

  • \(f(t,r)=P_0*(1+r)^t\)

This function is being calculated one day at a time, which means the t value will be 1, giving us this:

  • \(f(1,r_1)=f(0,r_0)*(1+r_1)\)

In plain terms, the next Cumulative Death total is equal to the previous Cumulative Death total times, 1 plus the current rate.

This also leads to a few other things:

  • \(r_1=\Delta r_1+r_0\)

Meaning, current Rate is equal to the change in the current rate plus the old rate.

  • \(\Delta f(1,r_1)=f(0,r_0)*r_1\)

Using the above to get; the change in the Cumulative Deaths (deaths per day) is equal to the previous Cumulative Deaths time the current Rate.

  • \(f(1,r_1)=\Delta f(1,r_1)+f(0,r_0)\)

Thus, the current Cumulative Death is equal to the change in the Cumulative Death plus the previous Cumualtive Death total.

3.2 The Algorithm

To start, there are a few things to note about how I built the algorithm for this analysis:

  • It was built using the mean and standard deviation from the past 8 days,
    • what this means is, it will only look at the last 8 change in rates to base the next day off of.
    • Once it predicts a future rate, the oldest rate can be thrown away, and the algorithm can re-model the predictions mean and standard deviation,
      • this is ensure that the model changes in accordance with the predictions, otherwise there would start to be a more linear aspect to the predictions.
  • I have split this algorithm into multiple functions to stop at certain points, in order to explain whats going on.

Lets dive in by taking a look at how the rates of change work!

3.3 Pre-Prediction Analysis

Here I will go into a more in-depth analysis of the Change in Rates, in order to prep it for the algorithm!

To start, the Rate of Change can be formulated by:

  • \(r_1=\frac{\Delta f(t,r_1)}{P_0}\)

In plain terms, this formula takes the change per day of the deaths, and divides that by the previous days Cumulative Deaths! Making the Deaths per day into a percentage of the Cumulative Deaths!

##          Date Deaths Deaths_PD Deaths_RC
## 27 2020-03-19   3405       427 0.1433848
## 28 2020-03-20   4032       627 0.1841410
## 29 2020-03-21   4825       793 0.1966766
## 30 2020-03-22   5476       651 0.1349223
## 31 2020-03-23   6077       601 0.1097516

Now that this step is complete, the change these rates can be calculated. Essentially looking at how much the rates are fluctiating over time. This can be done simply by taking the first deaths rate of change and subtracting the second days rate of change from that. This will mean if the rate goes up, then the change in that rate will go up as well!

  • \(\Delta r= r_1-r_0\)
##          Date Deaths Deaths_PD Deaths_RC      D_RC_C
## 27 2020-03-19   3405       427 0.1433848 -0.04638745
## 28 2020-03-20   4032       627 0.1841410  0.04075615
## 29 2020-03-21   4825       793 0.1966766  0.01253562
## 30 2020-03-22   5476       651 0.1349223 -0.06175431
## 31 2020-03-23   6077       601 0.1097516 -0.02517064

Great, here are Italys’ cumulative Deaths, Deaths per day, Deaths rate of change, and the change in the rate of change. All of this information will be used to work backwards at the end, to fill in the predictions.

First, lets take a look at the distribution of last 8 days’ Change in Rate.

This is the distribution of the rate of change, from a visual standpoint, there seems to be a slight dip at 0, suggesting that the changes skip over 0. This is a good thing, as the Change in Rate should always be changing, as long as the Cumulative Deaths are changing.

Lets do another Shapiro Wilk test on this, to check its normality.

## 
##  Shapiro-Wilk normality test
## 
## data:  italy$D_RC_C
## W = 0.89634, p-value = 0.2677

Exactly what was needed, the p value is 0.2677 > 0.05, meaning the distribution is normal enough to use as a predictor!

3.4 Building the Algorithm

Now that the Change in Rate has been calculated and tested, the prediction model can start to be built!

This process will be walked over in the code chunks, as I am building the process by hand. The overall process of the algorithm was mentioned at the start of this section.

######################
### functions used inside prediction model function
######################

# FUNCTION for getting the rate of change
get_rc <- function(death_rc_n1, change){
  #rc = lag death_rc + change
  rc=death_rc_n1+change
  
  if (rc <0){
    rc=0
  }
  return(rc)
}

# FUNCTION for getting the Deaths per day
get_pd <- function(deaths_n1, deaths_rc){
  #lag death*roc = pd
  
  pd = (deaths_rc*deaths_n1)
  
  return(ceiling(pd))
}

# FUNCTION for getting the Cumulative Deaths
get_death <- function(lag_death = lag_death, dpd){
  
  death = lag_death+dpd
  return(ceiling(death))
}

### FUNCTION used for prediction model
prediction_model <- function(data, days, trials, split, curr_d){
  
  #getting the current day inside the data passed
  curr_d = max(data[data$D_RC_C != 0,"Date"])
  
  
  for (n in 1:trials){ #loop for num of simulations
    temp <- data %>% #create a ned df for each sim
      subset(select = c(Date, D_RC_C))

    for (i in 1:days){ #loop for num of pred days
      mu <- temp %>% #grabbing mean
        subset(select = D_RC_C) %>% 
        slice(nrow(temp)-(nrow(temp)-split):nrow(temp)+i) %>% 
        as.matrix() %>% 
        mean()
      
      sd <- temp %>% #grabbing standard deviation
        subset(select = D_RC_C) %>% 
        slice(nrow(temp)-(nrow(temp)-split):nrow(temp)+i) %>% 
        as.matrix() %>% 
        sd()
      
      set.seed(i*13*n) #setting a changing seed
      
      temp <- temp %>%  #adding a new row to the df
        bind_rows(data.frame(Date = as.Date(curr_d, format = "%Y-%m-%d")+i, D_RC_C = rnorm(1, mu, sd))) %>% 
        mutate(
          Trial = n #adding the trial number
        )
    }
    
    if (n == 1){ #to create the new trial data
      trial_data <- temp
    }
    else{ #to combine the trial data
      trial_data <- temp %>% 
        bind_rows(trial_data)
    }
  }
  
  return (trial_data) #FINISHED
}

#Grabbing the current day for usage
curr_d <- max(italy[italy$D_RC_C != 0,"Date"])

TRIALS <- 100 #set number of trials
DAYS <- 10 #set number of days
SPLIT <- 8 #set number of window

italy <- italy %>% 
  prediction_model(days = DAYS, trials = TRIALS, split = SPLIT, curr_d) %>% 
  group_by(Trial, Date) %>% 
  slice(n()) %>% 
  ungroup() %>% 
  mutate(
    Deaths = NA,
    Deaths_PD = NA,
    Deaths_RC = NA,
    Country = "Italy"
  )

This graph shows just one Trial of the Change in Rates. For this test model, I ran a total of 20 simulations, so keep in mind, each individual simulation will look different.

As shown, in the beginning of the data, where the Cumulative Deaths were still very low, the variation of those changes were very drastic, over time the Changes started to slow down, so this algorithm is looking at the last 8 days of the actual data, and keeping the revolving door moving, it should start to even out over time, converging to 0. When the Changes in the Rates converge to 0, this means that the Deaths per day are staying stagnant, presumably this will happen when the disease has done a few things:

  • An important note is: I put a hard cap on the Rate of Change. If the change dropped below 0, I prevented the algorithm from dropping the rate below 0.
    • This is due to the fact that one cannot LOSE values in a cumulative sum,
      • losing values would indicate a negative rate of change.
    • Relating this to real life, this could possibly mean a lull in the virus, or a resurgance if the changes pick back up.
      • This is an unlikely scenario, as we are adjusting the parameters of the predictor, as the number of days increase.
  • The Rate of Deaths has stalled at some number,
    • meaning the Rate of Death is staying at a constant rate.
  • Hopefully the Rate of Death converges to 0,
    • this would mean that COVID-19 heading toward an eventual stop.
  • The Rate of Deaths could converge to a steady rate,
    • this would suggest that COVID-19 is on a steady track, either upwards or downwards.

Now that all the Changes in Rates have been predicted, lets’ run the algorithm to calculate backwards to fill in the blanks!

Awesome, now all the blanks are filled in marking the end of the Algorithms’ job! Lets start with a quick Analaysis of these results!

4 Analysis of the Monte Carlo Algorithm

It is important to note a few things about Monte Carlo Algorithms:

Remember: You can deselect the actual data to get a closer look at the simulation runs.

As shown, these are the 100 simulations for a 10 day forecast of Italys’ Cumulative Deaths. There are a few things to note from this graph:

Lets take a deeper look into these numbers.

Here the clustering of the simulations is apparant. It seems the farther the forecast, generally the less clustered it becomes. This means that the Variance is growing as the algorithm predicts each day, something that is expected as this is using a Markov Technique.

A few notes about this graph:

Lets take a look at the distribution of these indivdual days!

Remember: You can deselect the simulation days to get a better look at individual days, or to compare a few days.

This shows what exactly is going on with each day that the algorithm predicts.

Now lets take a look the individual rates!

hc <- italy %>% 
  subset(Date >= curr_d) %>% 
  hchart(type = "scatter", hcaes(x=Date,y=Deaths_RC, color = Trial),
         name = "Simulation")
  
val <- og %>% 
  subset(Date == curr_d, select = Deaths_RC) %>% 
  as.numeric()

y=0
n=0

for (rate in italy$Deaths_RC){
  if(rate < val){
    y=y+1 #less than last known rate
  }
  else{
    n=n+1 #greater than last known rate
  }
}

#cat(y," ",n) #unlock to get numbers

hc <- hc %>% 
  hc_xAxis(type = "datetime", dateTimeLabelFormats = list(day = '%d/%b')) %>% 
  hc_add_series(name = "Actual Data", data = og, type = "line", hcaes(x=Date, y=Deaths_RC), color = "black") %>% 
  hc_yAxis(title = list(text = "Rate of Change in Deaths"),
    plotLines = list(list(
      value = val,
      color = '#ff0000',
      width = 3,
      zIndex = 4,
      label = list(text = "Last Known Rate",
                   style = list( color = '#ff0000', fontWeight = 'bold' ))))) %>% 
  hc_add_theme(hc_theme_flat()) %>% 
  hc_title(text = "Italys' Rate of Change in Cumulative Deaths",
           align = "left") %>% 
  hc_subtitle(text = "#Simulations = 100, #days = 10, using last 8 days mu/sd",
              align = "left") %>% 
  hc_yAxis(labels = list(format = "{value}"), title = list(text = "Rate of Change of Cumulative Deaths")) %>% 
  hc_xAxis(title = list(text = "Date"),
           plotBands = list(
             list(
               label = list(text = "This is the Predicted Data"),
               color = "lightblue",
               from = datetime_to_timestamp(as.Date(curr_d)+1),
               to = datetime_to_timestamp(as.Date('2020-06-02'))
             )))%>% 
  hc_legend(align = "left") %>% 
  hc_tooltip(formatter = JS("function(){
                            return ('Date: ' + this.point.Date + ' <br> Rate of Change: ' + this.y +
' <br> Deaths per day: ' + this.point.Deaths_PD +
' <br> Simulation: ' + this.point.Trial)}"))
                
hc

Remember, these rates of changes were hard capped to 0.

Now it is obvious that there is a clear split between rates that are trending downwards and rates that are climbing upwards! This is great news, as it can generally imply that the Rate of Change will start to climb down. ~74.8% (823 rates < 0.109) of the rates are below the Last Known Rate of 0.109, while ~25.2% (277 rates > 0.109) of the rates are above the Last Known Rate. This can suggest a few things:

4.1 Running different hyperparameters

One thing that is possible with this algorithm, is changing the hyperparameters.

  • The number of Simulations.
  • The number of Days Predicted.
  • The window used for the mean and standard deviation.

I am going to import a dataset of this algorithm, that was ran with more different hyperparameters. I want to specifically limit the range to a 6 day forecast, as the Algorithm gets more inacurate with each consecutive day.

Remember: You can deselect the the Windows in the legend, to get a closer comparison between each hyperparameter.

This shows that a revolving mean and standard deviation can have completely different effects on the model, depending on what parameters are chosen.

This chart shows variation of the data, that appears to explode when anything over 10 is chosen. * This is most likely due to the fact that the rates were vastly different in the beginning stages of the spread, as the overall numbers are lower. + Which suggests that there should be a focus on looking at a window that is less than 10, for better results.

There is one thing I want to point out, when observing the window of 30 days, the median Rate of Change is the highest one out of all other options. This again suggests that the model should be used with a Window of less than 10.

Lets take a look deeper look at the simulations with windows that are less than 10 days.

Remember: You can deselect the the Windows in the legend, to get a closer comparison between each hyperparameter.

It seems that nearly every simulation showed a median change that is negative. This would highly indicate that the Change in Deaths per day is slowing down, meaning that the Cumulative Deaths is starting to slow down!

4.2 Comparing the Simulation to Real Life

lets take a look at the mean Deaths for the Trials, and compare that with the actual data that is now available for italy!

Remember: You can deselect the the Windows in the legend, to get a closer comparison between each hyperparameter.

This chart shows the actual Cumulative deaths for March 24th - 29th, as well as the simulated values. This suggests a few things about the simulations:

  • The closest hyperparameter to match the actual data is the Window of 9 days.
    • Simulating with an error within 800 Deaths per day!
      • This is an error of 7.18%
      • $ = $
  • The simulations seem to overall get worse over time
    • Again, this suggests that the model should only be used for short term predictions.

5 Finding the Infected Cases

There are a few things to note about the Infected Cases.

\({Deaths}={Infected}*{MortalityRate}\)

This is the formula to find the number of Deaths from any given population of infected individuals.

Taking the derivative of this formula with respect to time leads into a two options:

If the derivative is taken with Mortality Rate as a variable we get the following:

In plain terms, the change in Deaths is equal to the change in infections, time the Mortality Rate, plus, the change in Mortality Rate, times the infections.

There is one major problem with this, there is no way to get an accurate estimate on the Mortality Rate, which means it is impossible to get an accurate estimate on the change in Mortality Rate. So, while this would be the more realistic option, I do not believe it is feasable with the data that is available, as there is no way to get an accurate Mortality Rate without individual studies.

However, taking the derivative with Mortality Rate as a constant gives the following:

which also implys.

In plain terms, the change in Deaths is equal to the change in the infections, times the Mortality Rate. Noting that the change in infections is equal to the change in Deaths, divided by the Mortality Rate.

This means that with a constant Mortality Rate, the change in Deaths will also be the same as the change in Infections! As the rate is just scaling the numbers. This means that the change in Infections can be estimated by using the change in Deaths, scaling with the Mortality Rate afterwards!

While there is little chance the Mortality Rate is a constant value, taking a look at the bigger picture, it can be estimated what the Infected Cases might look like. This will not give an accurate per day representation, however it should give a more accurate picture overall.

To counteract this, I will find the estimated Infected Cases using multiple Mortality Rates:

Remember: You can deselect the Mortality Rates in the legend to get a better comparison.

This graph is made using the original data from (2-22-2020 to 3-23-2020), as well as a 6 day simulation bringing the final predicted Cumulative Deaths to the date of (3-29-2020). However, the graph stops at March 7th because there is a 23 day lag on the infected cases (5 days incubation + 18 day illness onset to death). This leaves a massive unknown when dealing with the true Infected Cases, especially for any recent Dates.

Despite this, there are a few things we can gather from this:

6 Conclusion

While a Monte Carlo Algorithm is not traditionally used to predict the future, the Algorithm can be used to make informed inferences about what is to come.

The Algorithm clearly showed a drop in the rate of change for the Confirmed Deaths, which could be indicitive of a few things happening:

This information can be used to infer on the Infected Cases, in a few ways: